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ABSTRACT 

Wireless sensor networks (WSNs) have attracted considerable attention in recent years and motivate a host of 
new challenges for distributed signal processing. The problem of distributed or decentralized estimation has often 
been considered in the context of parametric models. However, the success of parametric methods is limited by 
the appropriateness of the strong statistical assumptions made by the models. In this paper, a more flexible 
nonparametric model for distributed regression is considered that is applicable in a variety of WSN applications 
including field estimation. Here, starting with the standard regularized kernel least-squares estimator, a message- 
passing algorithm for distributed estimation in WSNs is derived. The algorithm can be viewed as an instantiation 
of the successive orthogonal projection (SOP) algorithm. Various practical aspects of the algorithm are discussed 
and several numerical simulations validate the potential of the approach. 

Keywords: wireless sensor networks, distributed estimation, regression, alternating projections, kernel methods, 
j 1 , distributed learning, nonparametric 
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,<->,' 1. INTRODUCTION 

1.1. Motivation 

Wireless sensor networks have attracted considerable attention in recent years [1]. Research in this area has 
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focused on two separate aspects of such networks: networking issues, such as capacity, delay, and routing 
strategies; and applications issues. This paper is concerned with the second of these aspects of wireless sensor 
networks, and in particular with the problem of distributed inference. Wireless sensor networks are a fortiori 
designed for the purpose of making inferences about the environments that they are sensing, and they are 
typically characterized by limited communications capabilities due to tight energy and bandwidth limitations. 
Thus, distributed inference is a major issue in the study of such networks. 

The problem of distributed or decentralized estimation has often been considered in the context of parametric 
models, in which strong assumptions are made about a statistical description of the environment. In [2], for 
example, the observations made by a sensor network are modeled by a Gaussian random field whose dependency 
structure is described by a graphical model; in such a setting, iterative message passing algorithms such as 
belief propagation (BP) [3], loopy BP [4], embedded trees [5], embedded polygons [2], and Gaussian elimination 
[6] can be used to efficiently estimate the field distributively. [7] investigates a distributed EM algorithm for 
estimating a field modeled as a mixture of Gaussians. In the case where sensors observe the same phenomenon 
through independent noisy channels, [8] studies a distributed maximum likelihood estimator using an information 
sharing algorithm based on Fisher scoring. [9, 10] derive distributed algorithms based on incremental subgradient 
methods that are useful for parameter estimation in sensor networks. Finally, [11] focuses on the case when sensor 
observations are quantized to one-bit. 

As in the standard centralized case, the success of distributed parametric methods is limited by the ap- 
propriateness of the statistical assumptions made by the model. In certain applications, such strong modeling 
assumptions are warranted and systems designed from these models show promise. However, in other scenarios, 
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prior knowledge is at best vague and translating such knowledge into a statistical model is undesirable. Appli- 
cations such as these pave the way for a nonparamctric study of distributed estimation. To this end, several 
nonparametric analyses have considered the distributed estimation problem. For example, weakening the as- 
sumptions made in [11], [12] focuses on universal decentralized estimation in the case when sensor observations 
are quantized to one bit. [13, 14] study the existence of consistent estimators in several models for distributed 
learning. 

In this paper, motivated in part by the success of (reproducing) kernel methods in machine learning [15], we 
consider a nonparamctric approach to distributed estimation that is based on regularized kernel least-squares 
regression. In particular, we assume that a network of sensors is distributed about a plane and that each sensor 
makes a local observation of a (say, temperature) field. If every sensor's position and field measurement are avail- 
able at a central fusion center, then the regularized least-squares estimator is a standard nonparametric estimate 
of the field (at locations perhaps not occupied by sensors) . More precisely, assuming that prior-knowledge about 
spatial correlations in the field can be encoded in a positive definite kernel, a standard centralized approach 
to field estimation is to minimize a regularized empirical loss functional over functions in the corresponding 
reproducing kernel Hilbcrt space [15]. 

Of course, due to limitations on energy and bandwidth, the sensors' measurements and positions are not 
available at the fusion center and therefore, such centralized estimation schemes are not feasible in wireless 
sensor networks. In this paper, however, we suggest a distributed algorithm for computing an approximation. 
To arrive at the approximation, we first interpret the classic estimator as a projection onto a Hilbert space 
and then consider a natural relaxation that is derived from the topology of the sensor network. The relaxation 
suggests a local message-passing algorithm based on the SOP algorithm (successive orthogonal projection) [16] 
for distributively estimating the field. Through several numerical experiments, it is observed that the algorithm 
converges after a reasonable number of iterations, that the accuracy of the distributed estimate is improved 
through the local message-passing algorithm, and that the estimate is comparable with the centralized kernel 
least-squares estimator. 

1.2. Related Work 

The work presented here is most similar to the research presented in [17], where a general framework for dis- 
tributed kernel linear regression is explored. The network model is essentially identical to the one considered 
there and thus, it is useful to contrast the methods. First, whereas the current work is cast formally within the 
context of reproducing kernel Hilbcrt spaces, [17] takes a weight-space view and attempts to estimate the field 
as a linear combination of local basis functions. Secondly, in [17], the algorithm is based on a distributed imple- 
mentation of Gaussian elimination using junction trees. In the current approach, a message-passing algorithm 
is derived from the SOP algorithm and a network topology dependent relaxation of the centralized least-squares 
estimator. Finally, in this paper, regularization is incorporated into the model. 

Alternating projection algorithms are often applied in other areas of statistical signal processing, for example, 
restricted least-squares regression [18] and computerized tomography [19]. In [20], an alternating projection 
approach was taken to the problem of localization in sensor networks. However, to our knowledge, such techniques 
have not been applied more generally to the problem of distributed, communication-constrained estimation. 

Alternating projection methods have often been used in algorithms for parallel optimization [16]. Algorithms 
similar to those presented in this paper may be useful to help circumvent the complexity induced by massive data 
sets in machine learning, possibly by parallelizing kernel methods. To our knowledge, SOP (or other alternating 
projection algorithms) have not been applied in this related context. 

Given the connection between kernel methods and Gaussian processes [21], those familiar with message- 
passing algorithms such a belief-propagation may find the algorithm presented here familiar. Formalizing such 
a connection would likely require one to interpret our "relaxation" in the context of dependency structures in 
Gaussian processes, and to connect alternating projection algorithms with the generalized distributive law [22]. 



1.3. Organization 

The remainder of the paper is organized as follows. In Section 2, we introduce notation and then review the 
SOP algorithm and regularized least-squares regression in reproducing kernel Hilbert spaces. In Section 3, we 
formalize the sensor network model and derive an algorithm for distributed estimation in sensor networks. In 
Section 4, we summarize several experiments with simulated data where the algorithms are evaluated. Finally, 
conclusions, extensions, and future work are discussed in Section 6. 

2. PRELIMINARIES 

2.1. Alternating Projections Algorithms 

Let X be a Hilbert space with a norm denoted by || • ||. Let C\, . . . , C n be closed convex subsets of X with a 
nonempty intersection C — fl™ =1 Cj. Let P c denote the orthogonal projection of x E X onto C, i.e., 

Pc(x) = argmin \\x — x\\. 

Define Pc t analogously. The following question arises: can Pc() be computed using {-Pc;( - )}?=i? Iteratively 
defined through (1), the (unrelaxed) SOP (successive orthogonal projection) algorithm is a natural approach: 

x :=x x n :=P Cn modm+1 0„_i). (1) 

In words, SOP successively and iteratively projects onto each of the subsets; much of the behavior of this 
algorithm can be understood through the following lemma, whose proof can be found in [23] (see Remark 6.19). 

Lemma 2.1. Let \Ci\™ =1 and C be as above. Let x n be defined as in (1). Then, for every x G C and every n, 

\\X„ - x\\ < \\Xn-! - x\\. 

Moreoever, lim^oo x n e fl-^Cj. Lf d are sub spaces for all i e {1, ...,m}, then lim^oo \\x n — Pc{x)\\ = 0. // 
the angle between the sets is strictly positive, then convergence is linear. 

This algorithm is often studied in the context of the convex feasibility problem and has a wide-range of signal 
processing applications (e.g., image reconstruction). Having been generalized in various ways [16], SOP often 
takes on other names (e.g., Bregman's algorithm). Below, we comment on how these generalizations may be 
useful in practical settings. For the sake of space, we forego a thorough examination of alternating projection 
algorithms and their applications and refer the reader to standard references on the topic [16,23]. 

2.2. Regularized Kernel Least-squares Regression 

Reproducing kernel Hilbert spaces are often considered in statistical signal processing and have been popularized 
within the machine learning community by the success of kernel methods. In this section, we briefly review 
regularized kernel least-squares regression in order to anchor our notation. For a more thorough introduction, 
wc refer the reader to various references on the topic; see, for example, [15, 24] and references therein. 

Let X and Y be X and ^-valued random variables, respectively. X is known as the feature, input, or 
observation space; y is known as the label, output, or target space. In the sequel, we take X = R 2 to model 
the positions of sensors in the plane and we take y = H to model real-valued sensor field measurements. In the 
least-squares estimation problem, we seek a decision rule mapping inputs to outputs that minimizes the expected 
squared error. In particular, we seek a function g : X — > y that minimizes 

E{| 5 (x)-y| 2 }. (2) 

It is well-known that r]{x) = E{Y \X = x} is the loss minimizing rule. However, without prior knowledge of the 
joint distribution of (X, Y), this regression function cannot be computed. In the supervised learning problem, 
one is instead provided a sequence of input/output training examples, S — {(xi, yi)}f =1 C X x y-, the learning 
task is to use S to estimate rj(x). Regularized kernel least-squares methods offer one approach to this problem. 



In particular, let Hk denote the reproducing kernel Hilbert space (RKHS) induced by a positive semi- definite 
kernel K(-, •) : X x X — > JR. More precisely, we can associate with any positive definite function K (•, •) a unique 
collection of functions Hk such that 

K t = K(-,t) G H K VteX 

m 

Y,^ K U G Hif V{ a ,}r =1 cR,m<oo. (3) 
i=\ 

If we equip Hk with an inner-product defined by < K s , K t >= K(s, t), extend Hk using linearity to all functions 
of the form (3), and include the point-wise limits, then Hk is called an RKHS. Note that 

<K x ,f>=f(x), 

for all x G X and all / G Hk', this identity is the reproducing property from which the name is derived. Finally, 
let II ' \\hk denote the norm associated with Hk- 

Example 1. When X = H d , a canonical RKHS is the one associated with the linear kernel, if (x,x') = x T x'. 

Example 2. When X = TR d , a second canonical RKHS is the one associated with the Gaussian kernel, K(x, x') = 
exp~H x ~ x "a . 

In the numerical studies in Section 4, we will explore examples using both the linear and Gaussian kernels. 

Returning to our discussion of least-squares estimation, let us apply this RKHS formalism to the regression 
problem. Given a positive semi-definite kernel K, which is often designed as a similarity measure for inputs, 
the regularized kernel least-squares estimate is defined as the solution fx G Hk of the following optimization 
problem: 

n 

min [£(/(a^-y0 2 + A||/||U. (4) 

1=1 

The first term in objective function (4) is a measurement of how well the rule "fits the data" ; the second term acts 
as a complexity control. A G 1R is a constant parameter that governs the trade-off between these two terms. In the 
learning context, the statistical behavior of this estimator is well-understood under various assumptions on the 
stochastic process that generates the examples {(xi, yi)}? =1 [15,25]. Moreover, this highly successful technique 
has been verified empirically in applications ranging from bioinformatics to hand-written digit recognition. 

For those unfamiliar with kernel methods, let us note in passing that this estimator has a useful variational 
interpretation in the context of Gaussian process estimation [21]. For example, one may posit a statistical model 
such as 

Ui = fj,(xi) + ei, 

and assume that (^i(x)) xe x is zero-mean Gaussian process with covariance function k(x, x 1 ) and that {ej}™ =1 is 
independent and identically distributed (i.i.d.) zero-mean Gaussian noise. Under these assumptions, f\(x) = 
E{ M (x) I S}. 

In this paper, we focus on algorithmic aspects of computing a solution to (4) (or an approximation thereof) in 
distributed, communication-constrained environments like sensor networks. To this end, consider the following 
"Representor Theorem" proved originally in [26]. The result is significant because it states that while the 
objective function (4) is defined over a potentially infinite dimensional Hilbert space, its minimizer must lie in a 
finite dimensional subspace. 

Theorem 2.2 ([26]). Let f\ e Hk be the minimizer of (4). Then, there exists c\ G 1R" such that 



n 

fx(-) = ^c x ,iK{-,Xi). 



(5) 



After plugging (5) into (4), one can derive that 



c x = (K + XI)- 1 y, (6) 

where K — (kij) is the kernel matrix (kij = K(xi,Xj)). In practice, the matrix inversion in (6) is often avoided 
by using algorithms for solving linear systems; often, sparsity structure in K can be exploited. In the learning 
literature, the process of computing c\ given S is often called training, whereas testing generally refers to the 
process evaluating the field estimate f\(x) — Ym=i c \,%K(x, xi) for a new input x. 

Finally, note that (4) can be interpreted as an orthogonal projection. In particular, by introducing an auxiliary 
vector z e H™, (4) can be rewritten as the following optimization program: 

min ||z-y||! + A||/||^ (7) 
s.t. Zi = f(xi) Vie{l,...,n} (8) 

zeR" 
/ e H K . 

Through the constraints in (8), the optimization problems are equivalent in the following sense: if fx is the 
minimizcr of (4) and (z',f' x ) is the solution of (7), then f' x = fx- Therefore, through (7), we can interpret the 
regularized kernel least-squares estimator as a projection of the vector (y, 0) 6 R" x Hk onto the set 

{(zJ)eR"x% : z i = f{x i )Vi£{l,...,n}}c-R n xH K . 
This simple observation will recur in the sequel. 



3. DISTRIBUTED LEAST-SQUARES REGRESSION 

3.1. The Model 

Consider a network of n sensors that are distributed about the plane R 2 ; let {x 4 }" =1 C H 2 denote the coordinates 
of the sensors' positions. Assume that each sensor can accurately localize itself in the plane, i.e., assume that for 
alH e {1, n}, sensor i knows Xj ; this assumption is justified by the existence of various localization algorithms 
for wireless sensor networks (e.g., [20]). In what follows, we will interchangeably refer to the i th sensor in the 
network by its index, i, or its position, Xj. 

Suppose that the sensors form a wireless ad-hoc network with a topology described by a graph; for example, 
consider the topology depicted in Figure 1. Each node in the graph represents a sensor in the network; an 
edge between two nodes posits the existence of a point-to-point (wireless) communication link between the 
corresponding two sensors. Let Ni C {1, n} denote the set of neighbors for sensor i. Interpret JVf as the set of 
sensors with which sensor i can communicate directly. Consistent with this interpretation, let us abuse notation 
slightly and assume that i £ N for every i 6 {1, ...,n}; in other words, each sensor can communicate with itself. 
We make no additional assumptions on the structure of the graph (i.e., we do not require it to be connected 
nor to form a spanning tree, etc.). However, the performance of the algorithms will indeed be effected by such 
properties and we comment on such relationships below. 

Note that in this model, the capacity of communication links in the network (i.e., edges in the graph) is 
not modeled explicitly. Instead, we assume that each link can support the simple messages to be passed by 
our algorithm; this assumption is consistent with other applications of message-passing algorithms (e.g., belief- 
propagation) to sensor networks. 

After the network has been localized, assume that each sensor makes a noisy estimate of a (say, temperature) 
field at its position; let y t denote the measurement of sensor i. One may posit a relationship between the sensor's 
measurement and its position. For example, one could assume that 



Ui = temperaturc(xi) + rij, 



(9) 



Figure 1. Sample Sensor Network Topology 



for some Gaussian random field (temperature(x)) xgR 2 , and an i.i.d. zero-mean noise sequence {rii}" =1 . However, 
such assumptions are not required in this model for nonparametric regression. 

Assume that a centralized fusion center wishes to use the sensors' local measurements to infer the field across 
the plane. In other words, the fusion center would like to determine a function / : H 2 — > II such that the 
expected squared error, 

E{|/(X) - temperature(X)| 2 }, 

is small. Here, the expectation could be with taken respect to any probability distribution defined on H 2 ; for 
instance, the distribution may model the mechanism through which sensors are randomly positioned in the plane. 

If the fusion center had access to every sensor's position and temperature measurements, the regularized 
least-squares estimator is a natural choice. Using the formalism of Section 2.2, one may take X — 1R 2 to model 
the positions of sensors in the plane and take y = Hi to model the sensors' real- valued field measurements. After 
choosing a kernel K(-, •) to reflect the designer's prior knowledge of spatial correlation in the field, the fusion 
center may estimate the field as fx, the solution to 

n 

min [£(.f(x 4 )-^) 2 + A||/||U. (10) 

1=1 

Here, as in Section 2.2, Tin denotes the RKHS generated by the kernel K. Of course, due to constraints on 
energy and bandwidth, the fusion center will not have access to any of the data in {(^i,Ui)}f =1 and therefore 
(10) cannot be solved using centralized methods; in general, classical kernel methods are infeasible for the sensor 
network. Moreover, even if an algorithm exists to solve (10) distributively, the sensors may not be able to share 
the estimated model (i.e., fx) with the fusion center, also due to the complexity of communicating. In short, 
our assumption is that the tight communication constraints preclude both centralized training and centralized 
testing in this model for distributed regression. 

In what follows, we show how the optimization problem in (10) can be relaxed for the problem of distributed 
regression. The relaxation will be derived in light of the topology of the sensor network and will suggest an 
algorithm for distributively estimating the field. In the algorithm, each sensor will communicate only with 
neighboring sensors and each sensor will locally determine its own global estimate for the field. In addition, we 
will suggest several algorithms for aggregating the sensors' estimates at the fusion center. 

3.2. An Algorithm for Distributed Regression 

To begin, let us associate a function /, 6 Hk and a parameter A; £ R with sensor i . Now, consider the following 
optimization program: 

min Eti(M^)-m) 2 + Eti^\\M\ 2 H K (ii) 

s.t. /^xO = /i(xi) Vi,je{l,...,n} (12) 

fi€H K V*e{l,...,n} 



Here, the optimization variables are C Hk] {( x ijJ/i)}™ = i C H 2 x II and {Ai}" =1 C H are the program 

data. The coupling constraints in (12) dictate that for any feasible solution to (11), every sensor's associated 
function is equivalent when evaluated at {xj}™ =1 . As a result, one can think about (11) as an equivalent form of 
(10) in the following sense. 

Lemma 3.1. Let (f\ 1 , ■■■,f\ n ) G H K denote the solution of (11) and let f\ G Hk denote the solution of (10). 
Assume that Xi > Vi G {1, n). Then, f\ 1 = ■ ■ ■ = f\ n . IfY^i=i ^» = ^> then f\ = f\ ± . 

This form of the regularized least-squares regression problem suggests a natural relaxation that allows us 
to incorporate the communication model of the sensor network into the estimator. In particular, we relax the 
coupling constraints as follows: 

mm Er=i(/i(^)-W) 2 + E^iAi||/i||^ (13) 
s.t. / J -(x i ) = / i (x i ) VjeiVi,ViG{l,...,n} (14) 

fi£H K Vie{l,...,n} 

In comparison to (11), the coupling constraints in (13) require two sensors' functions to be equivalent only when 
evaluated at shared neighbors' locations. More precisely, for any feasible solution to (13), /i(xfe) = /j(xfc) if 
k G NiPiNj. Since we have assumed i G Ni for all i G {l,...,n}, this interpretation is equivalent to the 
constraints stated formally in (14). 

Next, note that just as (10) can be interpreted as a projection via (7), (13) can be interpreted as a (weighted) 
projection of the vector (y, 0, . . . , 0) G H" x H K onto the set C = n" =1 Cj, with 

Ci = {(z, A,... , /„) : z G IT, fi G Hk, and / t (x,) - Zj Vj G N t } C R n x H\. (15) 

This fact is straightforward to derive using the same manipulation applied in Section 2.2. The significance of 
this observation lies in the fact that the relaxed form of the regularized kernel least-squares estimator has been 
expressed as a projection onto the intersection of a collection of n convex sets; in particular, note that each set 
d is a subspace. Thus, the SOP algorithm can be used to solve the relaxed problem (13). Moreover, computing 
Pc'i(-) requires sensor i to gather information only from its neighbors in the network. More precisely, note that 
for'any v = (z,/i, . . . , /„) G M" x H' K , P Ci (v) = (z*, /*) where 

/* = fi Vj^i 

ft = arg min ^ (/(xj) - Zi f + Xi\\f - fif HK 

zj = zj Vj i Ni 

z* = f*(*j) VjGNi 

To emphasize, computing -Pc;( v ) leaves Zj unchanged for all j ^ Ni and leaves fj unchanged for all j ^ i. The 
function associated with sensor i, f* can be computed using {zj}j£N t and /j. Thus, tying these observations 
together, we are left with an algorithm for distributively estimating the field by solving a relaxed form of the 
regularized least-squares estimator (13). The algorithm, SN- Train, is summarized in psuedo-code in Table 2 and 
depicted pictorially in Figure 2. 

Before proceeding with a discussion of the algorithm, let us briefly mention two important properties. First, 
note that the asymptotic behavior of SN- Train is implied by the analysis of the SOP algorithm. In particular, 
we have the following. 

Lemma 3.2. Let (f\ 1 , . . . , f\ N ) e H K be the solution to (13) and let {/ s ,t}" = i C Hk be as defined in SN-Train. 
Then, 

lim f SiT = f Xs (16) 

1 —>oc 

for all s £ {1, . . . , n}. 



Initialization: Kernel update: sensors notified which kernel to use. 

Sensor localization: sensor s determines x s 
Neighbors share locations: sensor s stores {*j}jeN s 
Sensor s initializes z sfi = yi, / Si o = G TLk 



Train: 



fort=l,...,T 

for s = 1, . . . ,n 
Sensor s: 



:ueries z^t-i Vj e N s 

r E 3 eN s (f(^) - z jtt -i) 2 + KWf fs,t-i\\ 2 HK _ 



end 



end 



/ S)t := argmin /eHK J2 3 eN s (f l 
Updates z jtt = f s A x j) ^3 <= N > 



Table 1. SN- Train 




Figure 2. Depiction of Training phase of SN-Train 



This lemma follows from Lemma 2.1 and the fact that convergence in norm implies point- wise convergence 
in an RKHS. Given the structure of RKHS and the general analysis in [23], SN- Train is expected to converge 
linearly for many kernels. We forego a discussion of this important, but technical point for the sake of space. 

Observe that Lemma 3.2 characterizes the output of SN- Train relative to (13). This characterization is useful 
insofar as it sheds light on the relationship between the output of SN- Train and (10), the centralized regularized 
least-squares estimator. The following straightforward generalization of Theorem 2.2 is a step toward further 
understanding this important relationship. 

Lemma 3.3. Let (/a i; . . . , f\ n ) 6 TL 1 ^ be the solution to (13) . Then, for every sensor s G {1, . . . , n}, there exists 
c Xs G H |JVs such that 

/*.(•)= X>A.,i#(-,x i ). (17) 

i<EN s 

The proof of this lemma follows from the original Representer Theorem (applied to the update equation for 
f s ,t) and the fact that TL 7 ^ is closed. The significance of the result lies in the fact that connectivity fundamentally 
limits the accuracy of any sensor's global estimate of the field. In particular, a sensor connected to few sensors 
in an otherwise dense network will be limited to estimates that lie in the span of functions determined by its 
neighbors; thus, connectivity influences the bias of the network. Intuitively, however, local message-passing in 
SN- Train should aid in optimizing the estimator within that limited span as long as the network is connected; 
this intuition is borne out in experiments in Section 4. 

3.3. Discussion 

Now, let us consider several issues relevant to the practical implementation of SN- Train. 

Communication In the training phase of the SN- Train algorithm, communication occurs only between neigh- 
boring sensors during the Query and Update stages of the iterative process. In the Query stage, the 
message from sensor i to sensor j is an estimate of the field at sensor i; in the Update stage, the message 
from sensor i to sensor j is sensor i's "new" estimate of the field at sensor j. To emphasize, though each 
sensor retains a locally estimated function, the messages passed through the network are not functions, but 
real numbers that represent the network's estimate of the field at the sensor locations. Moreover, though 
only local information is exchanged, global information is conveyed. In particular, there are examples of 
RKHSs paired with very sparse network topologies for which the output of SN- Train is equivalent to the 
centralized estimator. 

Computation In training phase of the SN- Train, computation occurs when sensor s updates / Sjt . By a mild 
generalization of Theorem 2.2 [15], f s ,t( m ) = J2jeN s c s,t,jK(-,xj) for some vector c Sit G R,'^'. By substi- 
tuting this representer into the update equation for / S;t in SN- Train, the update rule can be derived in 
terms of c Sjt as follows: 

c a , t - {K„ + A a /) _1 (z a , t _i + A s c s , t _i), (18) 

where z Sjt _i G II'^ 3 is appropriately defined to be the vector of measurements that sensor s receives during 
the Query stage. Note that in principle, sensor s can compute c s ^ efficiently if the number of neighbors is 
reasonably small, as is anticipated in sensor networks. Also note, assuming that the field is slowly varying 
(i.e., "neighboring sensor observations are correlated") and the sensor network topology is "local", then 
the kernel matrix K s is not likely to be sparse. 

Parallelism As described in Table 1, the inner loop of SN- Train iterates over the sensors in the network 
serially. Note this ordering is non-essential and parallelism may be introduced. In fact, two sensors can 
train simultaneously as long as they do not share an neighbors in the network (as depicted in Figure 2). 
In practical settings, various multiple-access algorithms (e.g., ALOHA) may be adapted to negotiate an 
ordering in a distributed fashion. Since the SOP algorithm and Lemma 2.1 have been generalized for a 
very general class of (perhaps random) control ordcrings [23], Lemma 3.2 can be extended in many cases. 



Localization In the initialization phase of SN- Train, sensors are assumed to store the positions of neighboring 
sensors. What is actually crucial is that each sensor s stores an evaluation of its local kernel matrix 
K s = [kij]ijtN s € ]Rl JVs l x l JVs l ) which is needed at each update of f Stt (see the comments on Computation, 
above). However, since many popular kernels depend only on the distance between sample points (e.g., 
kij = cxp~H Xi ~ Xi ll 2 for the Gaussian kernel), localization may be overkill. It may be sufficient for each 
sensor to estimate relative distances between all neighboring sensors. 

Storage As mentioned in the comments on localization, it is crucial that each sensor s stores an evaluation 
of its local kernel matrix K s . Moreover, each sensor must store a representation of f s ,t- By (18), it is 
sufficient to store an additional \N S \ coefficients. Thus, the storage requirement of a sensor is influenced 
entirely by how many neighbors it has in the network. In short, sensor s must maintain 0(|A S | 2 ) floating 
point numbers. 

Centralized special case Note that it may be useful to think of this model for distributed regression as a strict 
generalization of the centralized kernel linear regression framework. In particular, the standard kernel linear 
regression model corresponds to a fully connected network in the sense described by Lemma 3.1. 

Robustness The SN- Train algorithm can be made robust to a changing network topology, for example due 
to sensor failures, etc. For the sake of space, we will not elaborate the point here, but note in passing 
that SN- Train can be adapted to allow the neighborhood N s j of sensor s to be a function of time. The 
resulting algorithm is essentially identical to SN- Train (allowing for necessary localization information to 
be exchanged midway). Moreover, the algorithm converges to the solution implied by the largest stationary 
neighborhood that occurs "infinitely often" and in a well-defined sense, "progress" is made at each iteration 
(even when the "current" neighborhood does not correspond to the largest). An analysis of this robust 
approach requires several of the SOP generalizations discussed in [23]. 

Aggregation As discussed above, it may be unreasonable to assume that the sensors can share the output of 
the distributed training algorithm, {/ s ,t}"=i, due to the complexity of communication in wireless sensor 
networks. For the sake of the subsequent experimental studies, we will assume that when the fusion center 
wishes to find the temperature at x G R 2 , it may employ one of the following three strategies to aggregate 
the response of the network. 

• Single Sensor The fusion center will simply choose one sensor arbitrarily, say sensor s, and use it 
for all future field estimates. Note that such a rule is feasible, since in SN- Train, each sensor locally 
derives global estimate / Sj x for the field. 

• k Nearest-neighbor The fusion center will average the estimates provided by the k sensors nearest 
x. If a C {1, n} is the set of the k sensors nearest x, then fusion rule will aggregate according to 
the following rule: 

/fc-NN(x) = ^/ 8 ,r(x). (19) 

Both the simple nearest-neighbor and simple network average are special-cases of the k nearest- 
neighbor fusion rule, corresponding to k = 1 and k = n, respectively. 

• Connectivity-averaged The fusion center take a weighted average of the estimates provided by all 
n sensors. The weights in the average will be determined by how connected the sensors are in the 
network. In particular, the fusion center will consider the following aggregation strategy: 

1 " 

/ca(x) = ]T|A s |/ s , T (x). (20) 

2^s=l H V sl s= i 

4. EXPERIMENTS 



In this section, we study SN- Train numerically through several simulations. Indeed, each of the issues discussed 
in Section 4 requires further examination. However, due to limited space, we focus our attention. First, we study 
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the convergence rate of SN- Train. In particular, we study the number of outer iterations (i.e., T) needed for the 
expected error of fusion center's estimate (when using the various fusion rules discussed in Section 4) to converge. 
Since even local communication can be a burden on power and delay if the number of necessary iterations is 
excessive, we hope to quantify this aspect of our iterative approach. In a second study, we examine the quality 
of the estimate derived through SN- Train. Since any algorithm for distributed estimation is ultimately validated 
by the accuracy of the estimate, we examine the effect that local communication has on estimation quality. 

4.1. Setup 

In our model description in Section 3, we have assumed that sensors are distributed physically in the plane. 
However, the algorithm and results hold for any Euclidean* space (e.g., M d for d ^ 2). Thus, to simplify our 
study, we take d = 1 and assume that n sensors are distributed uniformly along the interval X = [—1,1] CE 

In what follows, we model the observations made by the sensors as, 

Vi = r){xi) + rii, (21) 

where {nj}™ =1 is an i.i.d. sequence of zero-mean Gaussian random variables with variance a 2 . We consider two 
cases that differ primarily in the way we choose ?y(-)- To keep the problems interesting, we also modify a and K 
(the kernel employed by the sensors) according to our selection of rj(-). In particular, we consider the following 
two cases. 

Case 1: ri(x) — r)i(x) = 5x + 5,a = 7,K = linear kernel. 
Case 2: ri(x) — 772(2:) — sin(7ra;), a = 1, K = Gaussian kernel. 

For convenience, Case 1 and Case 2 are contrasted pictorially in Figure 3. A random sampling of n = 50 
sensor positions is depicted along the lower horizontal border of each plot; a corresponding sampling of sensor 
observations is plotted as well. In Case 1, the regression function is linear and hence, local information is useful 
in constructing global field estimates. In Case 2, the regression function has been constructed so that local 
information is comparatively less useful in constructing global field estimates. As discussed below, this difference 
will influence the performance of SN- Train in various ways. In what follows, we examine both convergence and 
estimation quality in experiments using Case 1 and Case 2. The parameter r > is used to define the network 
topology for a random sensor network. In particular, two sensors are neighbors in the network if the (Euclidean) 
distance between them is less than r. Finally, somewhat arbitrarily, we select the regularization parameters so 
that Xj = jj^jt, with k = 0.01 for all i 6 {1, . . . , n}. 

'Actually, since kernel methods can be derived for any input space for which a reproducing kernel Hilbert space 
embedding can be denned, our approach is broadly applicable. 
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4.2. Convergence 

To investigate the convergence rate of SN- Train, we vary T from 1 to 100. For each T, the regression function 
is randomly sampled 500 times; the test error of the single sensor, nearest neighbor, and connectivity-averaged 
fusion rule is measured on this held out test set after the network has been trained distributively with SN- Train. 
The results for Case 1 and Case 2 are plotted in Figures 4 and 5, respectively. On the left, we randomize over 
S = 200 random sensor positions and noise sequences and plot the average test error vs. T; on the right, a plot a 
typical output estimate under each of the fusion rules. For perspective, the error rate of the standard centralized 
kernel least-squares estimator is plotted as well. Note the error rate plots are logarithmic on the dependent axis. 

Observe that for both Case 1 and Case 2, convergence is reasonably fast for each of the three fusion rules. In 
particular, note that the nearest-neighbor rule converges after two to three outer iterations. Further, note that 
the nearest- neighbor fusion rule ultimately out-performs the other aggregation strategies, and is competitive with 
the centralized estimator. Moreover, the single sensor fusion rule is quite poor; despite the fact that each sensor 
retains a global field estimate, the bias of the estimate is constrained by Lemma 3. It interesting to note that 
the single sensor fusion is more competitive in Case 1 than in Case 2. Since for linear rules, global information is 
useful for local estimates, individual sensors benefit from receiving local messages that are derived from distant 
sensors. 

We note in passing that these observations are fundamentally tied to the simulated relationship between 
the network topology and the kernels chosen. In particular, in these experiments, neighboring sensors can 
communicate and can expect to have correlated (i.e., similar) field measurements. For instance, the convergence 
rate is not as fast and the nearest neighbor fusion rule is in general not superior when a "random" network 
topology is paired with a Gaussian kernel. In the interest of space, we do not explore this important point in 
this paper. 

4.3. Communication 

Since SN- Train adds complexity to the estimation algorithm, it is important to verify that this expense buys 
improved estimation accuracy. To do so, we contrast training distributively with SN- Train with a "local only" 
approach. In the local only approach, we make only one iteration through the network, avoiding the Update 
step in SN- Train. That is, each sensor computes a global estimate for the field using the samples observed in 
its neighborhood; however, information about these estimates is not exchanged through the message variables. 
Since the only difference between using SN- Train and "local only" is the Update step, the value of communication 
has been isolated. For perspective, we also construct the standard centralized least-squares estimator estimator 
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(with A = t^) and plot its error rate which is independent of r. (i.e., we assume all the samples are centrally 
located) . 

In these experiments, we select T = 200 to ensure that SN- Train has converged. The estimation accuracy 
is measured through the expected squared error; in particular, we randomly sample the regression function 300 
times and measure the test error empirically on this held out test set. We study the test error at various levels 
of connectivity of the network. In Case 1, we vary r from 0.1 to 0.6 in increments of 0.05 and in Case 2, we vary 
r from 0.1 to 2.1 in increments of 0.1. For each r, we randomize over S = 300 random sensor positions and noise 
sequences, and plot the average test error versus r for SN- Train, local only, and the centralized approach. We 
focus on the single sensor fusion rule, which allows us to compare how the local message passing influences each 
sensor's global estimate. 

Intuitively, as r grows, we expect the error of local only approach to decrease, since each sensor has a greater 
number of neighbors. The bias of the estimate should decrease since by Lemma 3, the sensors' estimates lie in 
the span of a larger set of basis functions; the variance should decrease since a greater number of samples are 
incorporated to average out the noise. A priori, such trends are not expected for SN- Train. 

The results for Cases 1 and 2 are plotted in Figure 6. Note, for clarity, we plot the test error in Case 2 loga- 
rithmically^. For Case 1, our expectations are confirmed for the local only approach; the test error decreases with 
increased connectivity. Moreover, the estimate derived from SN- Train also improves as connectivity increases. 
For Case 2, the test error decreases with connectivity for SN- Train, but not for the local only approach. We at- 
tribute this behavior to the fact that we are studying the single sensor fusion rule. For low levels of connectivity, 
individual sensors have very little information that is useful for estimating the field outside their neighborhood. 
Thus, for the sinusoidal regression function, the local only approach incurs large test error when its performance 
is assessed "globally" . Interestingly, the local communication in SN- Train significantly improves over this naive 
approach; while both techniques are fundamentally constrained by Lemma 3, the local message passing in SN- 
Train leads to a better estimate for each fixed level of connectivity. Further, note that after a certain degree 
of connectivity, the estimate derived from SN- Train is competitive with the centralized approach. The level of 
connectivity required for the local only approach to be similarly competitive is larger for Case 2 as compared to 
Case 1; for Case 2, the network is nearly fully connected before the local only approach is competitive with the 
centralized estimator. 

^This operation exacerbates the variance of our estimate; whereas the test error for the centralized approach should 
be independent of r, variance is observed in Figure 6. 
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Figure 6. Connectivity: Case 1 and Case 2 

5. CONCLUSIONS, EXTENSIONS, AND FUTURE WORK 

5.1. Conclusions 

There are several contributions of the current paper to the literature on distribution estimation. First, though 
kernel methods were advocated in [27] and studied in the context of distributed detection in [28] , to our knowl- 
edge regularized reproducing kernel methods have not been previously explored in the context of distributed 
estimation. Second, alternating projection algorithms are added to the list of iterative algorithms that can be 
usefully deployed for problems of distributed inference in sensor networks. Finally, a host of new statistical 
questions arise about the estimator, to be formalized in future work. 

5.2. Extensions 

The SOP algorithm has been generalized to a wide variety of non-orthogonal projections including Bregman 
projections [16]. Thus, using this generalization, similar algorithms can be developed for distributed estimation 
in sensor networks for other loss functions (i.e., non-squared error) and more general regularizers. In particular, 
the algorithms in the current paper can be generalized to handle loss functions and regularizers specified by 
Bregman divergences [16]. Moreover, as discussed in Section 3.3, SN- Train can be generalized in various ways to 
allow parallel and robust sensor update rules. 

5.3. Future work 

Future work will consider the extensions mentioned above and will analyze the statistical behavior of SN- Train. 
In particular, we focus on understanding the design of regularization parameters that yield stable and consistent 
estimates. Such work will rely on the extensive literature on the statistics of centralized kernel regression, as 
well as on various analyses of alternating projection methods. 
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